Project Overview¶
The primary objective of this project is to evaluate the employment and unemployment rates within the immigrant labor force in Canada, considering factors such as the country or continent of origin and gender.
The data used for this project was obtained from the Goverment of Canda Open Data and is available here. The data was colected for the region of Canada, excluding the territories.
The downloadable version of the project can be found on my GitHub repository. It includes the Jupyter Notebook file and the data in CSV format.
1. Data pre-processing¶
1.1. Exploration¶
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv("data/14100089.csv")
data.head(3)
| REF_DATE | GEO | DGUID | Immigrant status | Country of birth | Labour force characteristics | Sex | Age group | UOM | UOM_ID | SCALAR_FACTOR | SCALAR_ID | VECTOR | COORDINATE | VALUE | STATUS | SYMBOL | TERMINATED | DECIMALS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2006 | Canada | 2016A000011124 | Total population | Total population | Population | Both sexes | 15 years and over | Persons | 249 | thousands | 3 | v53056904 | 1.1.1.1.1.1 | 26121.7 | NaN | NaN | NaN | 1 |
| 1 | 2006 | Canada | 2016A000011124 | Total population | Total population | Population | Both sexes | 25 to 54 years | Persons | 249 | thousands | 3 | v53056905 | 1.1.1.1.1.2 | 14115.7 | NaN | NaN | NaN | 1 |
| 2 | 2006 | Canada | 2016A000011124 | Total population | Total population | Population | Males | 15 years and over | Persons | 249 | thousands | 3 | v53056906 | 1.1.1.1.2.1 | 12854.2 | NaN | NaN | NaN | 1 |
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 28512 entries, 0 to 28511 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 REF_DATE 28512 non-null int64 1 GEO 28512 non-null object 2 DGUID 28512 non-null object 3 Immigrant status 28512 non-null object 4 Country of birth 28512 non-null object 5 Labour force characteristics 28512 non-null object 6 Sex 28512 non-null object 7 Age group 28512 non-null object 8 UOM 28512 non-null object 9 UOM_ID 28512 non-null int64 10 SCALAR_FACTOR 28512 non-null object 11 SCALAR_ID 28512 non-null int64 12 VECTOR 28512 non-null object 13 COORDINATE 28512 non-null object 14 VALUE 27983 non-null float64 15 STATUS 529 non-null object 16 SYMBOL 0 non-null float64 17 TERMINATED 0 non-null float64 18 DECIMALS 28512 non-null int64 dtypes: float64(3), int64(4), object(12) memory usage: 4.1+ MB
1.2. Cleaning¶
We have a total of 19 columns, not all of which are necessary here. Let's narrow them down to the most essential ones. We will consider the following columns:
Immigrant status: This represents the status of immigrants and includes the following values: 'Total population', 'Landed immigrants', 'Immigrants landed 5 or fewer years earlier', 'Immigrants landed more than 5 to 10 years earlier', 'Immigrants landed more than 10 years earlier', and 'Born in Canada'.
Country of birth: This includes "Total population", "Canada", "North America", "Latin America" ,"Europe", "Africa", "Asia".
Labour force characteristics: These include 'Population', 'Labour force', 'Employment', 'Unemployment', 'Not in labour force', 'Unemployment rate', 'Participation rate', and 'Employment rate'.
VALUE: This column contains the values of the labour force characteristics based on all the other features.
We will also use Sex, Age group and REF_DATE.
UOM (Unit of Measure), and SCALAR_FACTOR are units for calculated values. UOM can be 'Persons' or 'Percentage', while SCALAR_FACTOR can be 'thousands' or 'units'.
# creating a new dataframe using only needed columns
work_force = pd.read_csv("data/Labour_force_by_country.csv",
usecols=["REF_DATE", "Immigrant status", "Country of birth",
"Labour force characteristics", "Sex", "Age group", "VALUE", "UOM", "SCALAR_FACTOR"])
work_force.head(3)
| REF_DATE | Immigrant status | Country of birth | Labour force characteristics | Sex | Age group | UOM | SCALAR_FACTOR | VALUE | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2006 | Total population | Total population | Population | Both sexes | 15 years and over | Persons | thousands | 26121.8 |
| 1 | 2006 | Total population | Total population | Population | Both sexes | 25 to 54 years | Persons | thousands | 14115.8 |
| 2 | 2006 | Total population | Total population | Population | Males | 15 years and over | Persons | thousands | 12854.2 |
2. Employment rate of immigrants by Country in 2021¶
We need one labour force characteristics here, that is the Employment rate. We will consider the entire population of landed immigrants, both male and female for age group 15 and over.
# let's set index to date to avoid reseting every time
work_force.set_index('REF_DATE', inplace=True)
#-- We set up the first labour force characteristic. We don't need data from the entire population here
employment_rate = work_force.loc[(work_force["Labour force characteristics"] == 'Employment rate') &
(work_force['Sex']=='Both sexes') &
(work_force['Age group']=='15 years and over') &
(work_force['Immigrant status']=='Landed immigrants')]
#employment_rate.head()
#-- what we need so fare
print(employment_rate[['Country of birth', 'VALUE']].head())
Country of birth VALUE REF_DATE 2006 Total population 57.3 2006 North America 60.7 2006 Latin America 67.3 2006 Europe 49.8 2006 Africa 59.6
We create a pivot table with columns from items in 'Country of birth' and corresponding 'VALUE'. And make some plots
employ_rate_by_country = pd.pivot_table(employment_rate, values='VALUE', index=['REF_DATE'],
columns=['Country of birth'])
employ_rate_by_country.head()
| Country of birth | Africa | Asia | Europe | Latin America | North America | Total population |
|---|---|---|---|---|---|---|
| REF_DATE | ||||||
| 2006 | 59.6 | 60.7 | 49.8 | 67.3 | 60.7 | 57.3 |
| 2007 | 61.3 | 61.5 | 50.1 | 64.4 | 61.2 | 57.6 |
| 2008 | 61.6 | 61.1 | 49.3 | 64.1 | 62.0 | 57.3 |
| 2009 | 60.9 | 58.9 | 48.3 | 62.6 | 55.6 | 55.6 |
| 2010 | 61.9 | 58.7 | 49.4 | 61.3 | 58.3 | 56.0 |
employ_rate_by_country.loc[2021].plot(kind='bar',stacked=False,figsize=(12, 6))
plt.title('Employment rate by country in 2021')
plt.ylabel('Employment rate(%)')
plt.xlabel('Country')
plt.show()
#work_force.set_index('REF_DATE', inplace=True)
labour_force = work_force.loc[(work_force["Labour force characteristics"] == 'Labour force') &
(work_force['Sex']=='Both sexes') &
(work_force['Age group']=='15 years and over') &
(work_force['Immigrant status']=='Landed immigrants')]
# Labour force by country in 2021
lf_by_country = labour_force.loc[2021, ['Country of birth', 'VALUE']]
# we can reset the index and drop total population
lfc_data = lf_by_country[lf_by_country['Country of birth'] != 'Total population']
lfc_data.reset_index(drop=True, inplace=True)
lfc_data
| Country of birth | VALUE | |
|---|---|---|
| 0 | North America | 150.7 |
| 1 | Latin America | 612.2 |
| 2 | Europe | 1015.0 |
| 3 | Africa | 597.7 |
| 4 | Asia | 3059.3 |
Now we get the geographical data¶
I use the World Continents data. In the geojson file I downloaded, continents are listed as follows: "Africa", "Asia", "Australia", "Oceania", "South America", "Antarctica", "Europe" and "North America". In our data, we have "Latin America" that needs to be changed int "South America". I also decided to remove North Amerca from the data because is significantly low compare to the rest. also because Canada's geolocation data is included to North America's. And will be visualized as part of North American immigration data, which we don't want.
import folium
import json
# Getting both data ready
with open ('World_Continents.geojson', 'r') as jsonFile:
geodata = json.load(jsonFile)
lfc_data = lfc_data[lfc_data['Country of birth'] != 'North America'].replace('Latin America', 'South America')
#lfc_data
m = folium.Map(location=[-30, 50], zoom_start=2)
folium.Choropleth(
geo_data=geodata,
name="choropleth",
data=lfc_data,
columns=["Country of birth", "VALUE"],
key_on="feature.properties.CONTINENT",
fill_color="RdYlGn",
fill_opacity=0.9,
line_opacity=0.2,
legend_name='Distribution of labour force (in Thousands) of immigrants in Canada by continent of origin in 2021',
).add_to(m)
folium.LayerControl().add_to(m)
# Display
m
4. Unemployment rate in 2021¶
Note: A landed immigrant is a person who has been granted the right to live in Canada permanently by immigration authorities
unemployment_rate = work_force.loc[(work_force['Age group']=='15 years and over') &
(work_force['Country of birth'] == 'Total population')&
(work_force['Labour force characteristics'] == 'Unemployment rate') &
(work_force['Sex'] != 'Both sexes')]
unemployment_rate = unemployment_rate.loc[2021, ['Immigrant status', 'Sex', 'VALUE']]
unemployment_rate.set_index('Immigrant status', inplace=True)
unemploy_rate_pivot = pd.pivot_table(unemployment_rate, values='VALUE', index=['Immigrant status'],
columns=['Sex'])
unemploy_rate_pivot
| Sex | Females | Males |
|---|---|---|
| Immigrant status | ||
| Born in Canada | 6.5 | 7.6 |
| Immigrants, landed 5 or less years earlier | 13.0 | 7.3 |
| Immigrants, landed more than 10 years earlier | 7.9 | 7.8 |
| Immigrants, landed more than 5 to 10 years earlier | 10.2 | 9.3 |
| Landed immigrants | 9.0 | 8.0 |
| Total population | 7.2 | 7.7 |
unemploy_rate_pivot.plot(kind='bar',stacked=False,figsize=(20, 10))
plt.title('Unemployment rate base on gender and immigration status in 2021')
plt.ylabel('Unemployment rate (%)')
plt.xlabel('Immigration Status')
plt.show()
Observations:¶
The unemployment rate among immigrants is lower compared to individuals born in Canada. However, the unemployment rate is notably higher within the female population. Additionally, a significant observation is the relatively low number of immigrants entering the workforce within the first 10 years of landing in Canada. This suggests that it takes approximately a decade for immigrants to fully integrate into the labor force.